Music genre classifier¶

Data exploration¶

This notebook should be see as the first step in a series of notebooks aimed to eventually build an audio classifier.

Until now, I've never worked with audio files or signal processing. Parsing these data, and seeing what features we can extract from them will be quite interesting. It is usually the set of MFCC coefficients that are used in ML models to classify audio, whether that be for music genre classification or speech recognition. MFCCs will be the final step of this notebook, though we will document our discoveries and learnings along the way there.

Goal¶

Expore the GTZAN Music Genres dataset of audio files.

Dataset¶

The dataset contains 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22050Hz Mono 16-bit audio files in .wav format. In other words, the data are sampled at 22050 Hz and 16 bits of resolution.

Source¶

https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification/ (accessed 2023-10-20)

Load¶

Load the GTZAN music genre dataset from the source mentioned above. The data is downloaded locally and unzipped to ../../../../data/gtzan_music_dataset/.

In [5]:
# audio specific imports
from scipy.fftpack import fft
from scipy import signal
import librosa as lr
import python_speech_features as psf
import scipy.io.wavfile as wav


from plotly.subplots import make_subplots
import IPython.display as ipd
import numpy as np
import os
import pandas as pd 
import plotly.express as ex 
In [6]:
data_dir = "../../../../data/gtzan_music_dataset/"
genre_dir = os.path.join(data_dir, "genres_original")
genres = os.listdir(genre_dir)
genres
Out[6]:
['pop',
 'metal',
 'disco',
 'blues',
 'reggae',
 'classical',
 'rock',
 'hiphop',
 'country',
 'jazz']
In [7]:
genre_dir
Out[7]:
'../../../../data/gtzan_music_dataset/genres_original'
In [8]:
def load_genre_features(genre_dir, verbose=1):
    # get all the files from the dataset
    genre_features = {}

    for i, folder in enumerate(os.listdir(genre_dir)):
        print(i, folder) if verbose else None
        
        file_paths = []
        rates = []
        signals = []
        mfcc_feats = []
        covariances = []
        mean_matrices = []

        features = []
        labels = []

        for file in os.listdir(genre_dir + "/" + folder):  
            print(file) if verbose > 1 else None
            try:
                file_path = genre_dir + "/" + folder + "/" + file
                (rate, sig) = wav.read(file_path)
                mfcc_feat = psf.mfcc(sig, rate, winlen=0.020, appendEnergy=False)  # what does appendEnergy do?
                covariance = np.cov(np.matrix.transpose(mfcc_feat)) # covariance = second moments of a distribution
                mean_matrix = mfcc_feat.mean(0)  # mean along the axis 0 (rows?)
                feature = (mean_matrix, covariance, i)  # i is the label

                file_paths.append(file_path)
                rates.append(rate)
                signals.append(sig)
                mfcc_feats.append(mfcc_feat)
                covariances.append(covariance)
                mean_matrices.append(mean_matrix)

                features.append(feature)
                labels.append(i)
            
            except:
                print("Error processing " + file_path)
                continue

        genre_features[folder] = {
            "file_paths": file_paths, 
            "rates":rates, 
            "signals": signals, 
            "mfcc_feats": mfcc_feats, 
            "covariances": covariances, 
            "mean_matrices": mean_matrices, 

            # actually useful for training
            "features": features,
            "labels": labels
        }

    return genre_features

genre_features = load_genre_features(genre_dir)
0 pop
1 metal
2 disco
3 blues
4 reggae
5 classical
6 rock
7 hiphop
8 country
9 jazz
Error processing ../../../../data/gtzan_music_dataset/genres_original/jazz/jazz.00054.wav
In [9]:
genre_features["pop"].keys()
Out[9]:
dict_keys(['file_paths', 'rates', 'signals', 'mfcc_feats', 'covariances', 'mean_matrices', 'features', 'labels'])

Feature exportation¶

This section is to get comfortable with the data we are handling, building the foundation for a wider analysis later.

In [10]:
genre_features["disco"]["rates"][0]  # all sample rates are the same 22050 Hz
Out[10]:
22050

Shapes¶

The dataset contains 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22050Hz Mono 16-bit audio files in .wav format. During the loading stage, we converted the signals to MFCCs, then built covariance matrices, and aggregated the MFCCs to mean arrays. Let's take a look at the shape of a genre's signals.

In [11]:
# np.array(genre_features["disco"]["signals"]).shape
np.array(genre_features["pop"]["signals"]).shape  # 661794 samples
Out[11]:
(100, 661504)

Not all genres were able to be directly converted into an array, since they were not all the same length. We need to check how the feature shapes vary between genre.

In [12]:
def print_genre_features_shapes(genre, features):
    print(genre, "count:", len(features["rates"]))
    print(genre, "rates", " value:", features["rates"][0] )
    print(genre, "signals", "unique shapes:",  set(x.shape for x in features["signals"]))
    print(genre, "mfcc_feats", "unique shapes:",  set(x.shape for x in features["mfcc_feats"]))
    print(genre, "covariances", "unique shapes:",  set(x.shape for x in features["covariances"]))
    print(genre, "mean_matrices", "unique shapes:",  len(features["mean_matrices"]))
    print(genre, "labels", "value:", features["labels"][0])
    print("-"*50)


def print_all_genre_features_shapes(genre_features):
    for genre in genre_features.keys():
        print_genre_features_shapes(genre, genre_features[genre])


print_all_genre_features_shapes(genre_features)
pop count: 100
pop rates  value: 22050
pop signals unique shapes: {(661504,)}
pop mfcc_feats unique shapes: {(2993, 13)}
pop covariances unique shapes: {(13, 13)}
pop mean_matrices unique shapes: 100
pop labels value: 0
--------------------------------------------------
metal count: 100
metal rates  value: 22050
metal signals unique shapes: {(661504,), (661794,)}
metal mfcc_feats unique shapes: {(2994, 13), (2993, 13)}
metal covariances unique shapes: {(13, 13)}
metal mean_matrices unique shapes: 100
metal labels value: 1
--------------------------------------------------
disco count: 100
disco rates  value: 22050
disco signals unique shapes: {(664180,), (667920,), (661504,), (666160,), (661344,), (661760,), (661676,), (668140,), (665060,), (663520,), (661794,)}
disco mfcc_feats unique shapes: {(2992, 13), (3014, 13), (3009, 13), (2993, 13), (2994, 13), (3005, 13), (3022, 13), (3023, 13), (3002, 13)}
disco covariances unique shapes: {(13, 13)}
disco mean_matrices unique shapes: 100
disco labels value: 2
--------------------------------------------------
blues count: 100
blues rates  value: 22050
blues signals unique shapes: {(661794,)}
blues mfcc_feats unique shapes: {(2994, 13)}
blues covariances unique shapes: {(13, 13)}
blues mean_matrices unique shapes: 100
blues labels value: 3
--------------------------------------------------
reggae count: 100
reggae rates  value: 22050
reggae signals unique shapes: {(661504,), (661794,)}
reggae mfcc_feats unique shapes: {(2994, 13), (2993, 13)}
reggae covariances unique shapes: {(13, 13)}
reggae mean_matrices unique shapes: 100
reggae labels value: 4
--------------------------------------------------
classical count: 100
classical rates  value: 22050
classical signals unique shapes: {(661408,), (672282,), (661794,), (665280,), (661760,), (661676,), (670120,), (669680,), (663080,), (663520,), (661344,)}
classical mfcc_feats unique shapes: {(2992, 13), (3030, 13), (3042, 13), (3032, 13), (3010, 13), (2994, 13), (3000, 13), (3002, 13)}
classical covariances unique shapes: {(13, 13)}
classical mean_matrices unique shapes: 100
classical labels value: 5
--------------------------------------------------
rock count: 100
rock rates  value: 22050
rock signals unique shapes: {(661408,), (669460,), (667920,), (670340,), (661500,), (661794,)}
rock mfcc_feats unique shapes: {(2992, 13), (3022, 13), (2993, 13), (2994, 13), (3033, 13), (3029, 13)}
rock covariances unique shapes: {(13, 13)}
rock mean_matrices unique shapes: 100
rock labels value: 6
--------------------------------------------------
hiphop count: 100
hiphop rates  value: 22050
hiphop signals unique shapes: {(675808,), (661408,), (660000,), (664400,), (661504,), (665280,), (669240,), (661760,), (667700,), (668140,), (669680,), (661676,), (661794,)}
hiphop mfcc_feats unique shapes: {(2992, 13), (3030, 13), (3021, 13), (2986, 13), (3057, 13), (2993, 13), (3010, 13), (2994, 13), (3006, 13), (3028, 13), (3023, 13)}
hiphop covariances unique shapes: {(13, 13)}
hiphop mean_matrices unique shapes: 100
hiphop labels value: 7
--------------------------------------------------
country count: 100
country rates  value: 22050
country signals unique shapes: {(661408,), (668800,), (666820,), (663300,), (669680,), (661760,), (663740,), (661100,), (661794,)}
country mfcc_feats unique shapes: {(2992, 13), (3030, 13), (3003, 13), (3026, 13), (2994, 13), (3017, 13), (3001, 13), (2991, 13)}
country covariances unique shapes: {(13, 13)}
country mean_matrices unique shapes: 100
country labels value: 8
--------------------------------------------------
jazz count: 99
jazz rates  value: 22050
jazz signals unique shapes: {(667480,), (661980,), (665940,), (666820,), (669240,), (665280,), (661676,), (672100,), (661794,)}
jazz mfcc_feats unique shapes: {(3041, 13), (3020, 13), (2995, 13), (3010, 13), (2994, 13), (3017, 13), (3028, 13), (3013, 13)}
jazz covariances unique shapes: {(13, 13)}
jazz mean_matrices unique shapes: 99
jazz labels value: 9
--------------------------------------------------

Pop and blues will be the easiest ones to convert to dataframes, since their signals all have the same length. The others need some resizing and/or padding. I'll look into the distributions of the "pop" genre, for now.

In [13]:
print_genre_features_shapes("pop", genre_features["pop"])
print_genre_features_shapes("blues", genre_features["blues"])
pop count: 100
pop rates  value: 22050
pop signals unique shapes: {(661504,)}
pop mfcc_feats unique shapes: {(2993, 13)}
pop covariances unique shapes: {(13, 13)}
pop mean_matrices unique shapes: 100
pop labels value: 0
--------------------------------------------------
blues count: 100
blues rates  value: 22050
blues signals unique shapes: {(661794,)}
blues mfcc_feats unique shapes: {(2994, 13)}
blues covariances unique shapes: {(13, 13)}
blues mean_matrices unique shapes: 100
blues labels value: 3
--------------------------------------------------
In [14]:
22050 * 30  # frequency * seconds
Out[14]:
661500

There are 661500 signals registered in a 30 second track at 22050Hz, which should correspond to the number of signals in a given song. We see in the samples above that the fiels genreally follow this rule, though some sings seem to have a few hundred more signals than others.

Signal¶

The signal represents the amplitudes of the audio file over time. I have no idea what the units are, and there seems to be mixed veiws online. Would need to talk with someone who knows these data better for a confident answer here.

In [15]:
genre_features["pop"]["signals"][:5]
Out[15]:
[array([ 1131,  1578,  2107, ..., -2279, -4733, -6299], dtype=int16),
 array([  -65,  2147,  4107, ...,  7470, 10374, 10001], dtype=int16),
 array([  405,  -352, -1036, ..., -6513,  1040,  1946], dtype=int16),
 array([-135, -457, -306, ..., 5473, 1978, 1510], dtype=int16),
 array([-3608, -5045,   851, ..., -6681, -8115, -8368], dtype=int16)]

Plot¶

In [16]:
first_pop_sig = genre_features["pop"]["signals"][0]

# for human readable x-axis
time_index = np.linspace(0, 30, len(first_pop_sig))
In [17]:
pd.options.plotting.backend = 'plotly'

pop_signals_df = pd.DataFrame(first_pop_sig, index=time_index)
pop_signals_df.plot(title="Signal - pop index=0", labels = {"index":"time (s)", "value":"Amp.", "variable":"signal"})
In [18]:
# we can listen to the signal
ipd.Audio(genre_features["pop"]["file_paths"][0])
Out[18]:
Your browser does not support the audio element.
In [19]:
def display_signal(signal, index=None):
    time_index = np.linspace(0, 30, len(signal))
    sig_df = pd.DataFrame(signal, index=time_index)
    sig_df.index.set_names("time", inplace=True)
    sig_df.plot(title=f"Signal - pop index={index}", labels = {"index":"time (s)", "value":"Amp.", "variable":"signal"}).show()


def play_song(file_path):
    ipd.display(ipd.Audio(file_path))


def display_n_signals(genre_features=genre_features, genre="pop", n=5):
    for x in range(n):
        print("-"*50)
        play_song(genre_features[genre]["file_paths"][x])
        display_signal(genre_features[genre]["signals"][x], x)

display_n_signals(genre_features, genre="pop", n=3)
--------------------------------------------------
Your browser does not support the audio element.
--------------------------------------------------
Your browser does not support the audio element.
--------------------------------------------------
Your browser does not support the audio element.

Signal processing¶

The first step of signal processing is to convert the signal to a frequency domain representation. This can be done using a FFT (Fast Fourier Transform) or a STFT (Short Time Fourier Transform). The FFT is a single transformation of the entire signal, while the STFT is a series of FFTs of smaller segments of the signal.

We'll start by resampling our raw signal loaded with scipy.io.wavfile to 22050Hz, giving us signal values between -1 and 1.

I'm a bit shaky on the math here, but it looks like this is the formula used in the librosa.load() method is:

$$ signal_{rs} = \frac{signal}{2^{ (n_{bits} - 1)}} $$

In [20]:
# help with resampling scipy wavs to librosa format: https://stackoverflow.com/a/66675932
# the files are 16-bit, so we need to divide by 2 ** (nbits - 1) to resample to -1 to 1
n_bits = 16  

signal_rs = genre_features["pop"]["signals"][0] /2 ** (n_bits - 1)

# librosa resamples the signal to 22050Hz by default
y, sample_rate = lr.load(genre_features["pop"]["file_paths"][0])

all(signal_rs == y)
Out[20]:
True
In [21]:
def resample(signal, n_bits=16):
    return signal / 2 ** (n_bits - 1)

signal_rs = resample(genre_features["pop"]["signals"][0])

Now that our signal is represented by floating points, we can use the FFT to convert the sample to the frequency domain.

We'll start with the whole sample, then look at a smaller segment of the sample.

In [22]:
signal_fft = np.abs(fft(signal_rs, n=22050*30))  # [:22050*10]  # 10 seconds

# whatever you do, don't plot the whole thing as bar. You break jupyter
ex.line(signal_fft, title='FFT Spectrum - pop index 0 ', labels={'index':'Frequency Bin', 'value':'Amplitude', "variable": "Signal"}).show()

As we can see, there is a lot of data plotted above. We risk breaking the notbeook kernel by playing too m uch with these data. Lets look at a shorter segment of the signal.

Since we know the frequency of the signal is 22050Hz, we can look at the first 3 seconds of the song, which should be 66150 signals long.

In [24]:
22050*3
Out[24]:
66150
In [25]:
n_stft = 22050*3

stft = np.abs(lr.stft(signal_rs[:n_stft], hop_length = n_stft+1))

ex.line(stft, title='STFT Spectrum', labels={'index':'Frequency Bin', 'value':'Amplitude'})
In [23]:
play_song(genre_features["pop"]["file_paths"][0])
Your browser does not support the audio element.
In [83]:
def stft_over_time(signal, rate=22050, window=3, count=9, genre="pop"):

    fig = make_subplots(rows=3, cols=3)
    fig.update_layout(title_text=f"STFT over first {window*count} seconds - {genre}")
    
    for i in range(count):
        n_stft = rate*window*i 
        n_stop = rate*window*(i+1)

        if n_stop > len(signal) or i > 8:
            print("n_stft > len(signal)")
            break
        
        stft = np.abs(lr.stft(signal[n_stft:n_stop], hop_length = (rate*window)+1))
        fig.add_trace(
            ex.line(stft).data[0], 
            col=i%3+1, 
            row=i//3+1,
        )

        fig.update_xaxes(title_text=f'STFT Spectrum - ({window*i}, {window*(i+1)}) seconds', col=i%3+1, row=i//3+1)
    
    fig.update_layout(height=1000, showlegend=False)

    return fig.show()

stft_over_time(signal_rs)
In [88]:
def play_and_show_stft(genre="jazz", index=0):
    play_song(genre_features[genre]["file_paths"][index])
    stft_over_time(resample(genre_features[genre]["signals"][index]), genre=genre)
    return 

play_and_show_stft(genre="jazz", index=0)
Your browser does not support the audio element.

Oh, the jazz one was nice! The piano and sax switch off on roughyl 3 second intervals. This clearly shows up inour STFT plots. Also, the notes are so crisp, to we see singluar spikes in the frequency domain, unlike the other genres, where human voices and many string and drums instriúments are used.

In [90]:
play_and_show_stft(genre="disco", index=5)
Your browser does not support the audio element.
In [91]:
play_and_show_stft(genre="rock", index=7)
Your browser does not support the audio element.

Now that we can see how the frequency domain representation of the signal changes over time, we can start to grasp what these transformations are telling us. I find it useful to play the audio, and noctice what changes between transformed windows, and what sounds the same.

For example, in this first pop song, the last 5-7 seconds of the song has background singers with lower energies, but similar tone, plus the main singer still singing. This is likely the reason we see a lot more frequencies present in the last plot, since a harmony of voices likely has more frequencies than a single voice. This makes me interested to see what a choir or orchestra would look like in the frequency domain.

Another thing to note is where the frequencies are present. In all the plots, we see a large skewing towards the lower frequencies, with very few higher frequencies present. This resembles the distribution of mel-scaled frequency filters, which we'll look at next.

MFCC¶

To convert the above FFTs or STFTs to MFCCs, we use the Mel-scaled filterbank. This is a set of 40 triangular filters that are applied to an FT. The filters are spaced on the Mel-scale to correspond to how humans perceive sound.

Filter bank on Mel-Scale
(H. Fayek, 2016)

Notice how the filters are more dense at the lower frequencies, and less dense at the higher frequencies.

In our FFT and STFT plots above, we saw a large representation of lower frequecies, with the most prominent frequencies between 0-150. With the Mel-scaled filterbank is more dense at the lower frequencies, one can assume the human auditory system is more sensitive to lower frequencies.

Taking this assumption one step further, let's think about the human voice. In most of the sampled audio, there is a female singer present. The average female voice tends to be in the frequency range of 165 to 255 Hz, which corresponds nicely with our STFTs.

These high distribution of lower frequencies in both this STFT and of the Mel-scaled filters are starting to feel intuitive. The human auditory system would have likely evolved to be more sensitive to the frequencies of other human voices.

While the conclusion drawn above seems logical, it could be a gross over-simplfication of the truth here. I'm still a beginner when it comes to these analysis, so I would want to research some of these further before making any claims with confidence.

Back to data exploration..¶

With some of the background intuition of mel-scales covered, we can now look into the MFCC features. These are usually what are used in audio classification ML tasks, since they represent a simplified approximation of the audio signal. For more on MFCCs and how they are derived from FFTs, see The Dummy's guide to MFCC - P. Nair (Medium).

In [18]:
genre_features["pop"].keys()
Out[18]:
dict_keys(['file_paths', 'rates', 'sigs', 'mfcc_feats', 'covariances', 'mean_matrices', 'labels'])
In [65]:
first_pop_mfcc_feat = genre_features["pop"]["mfcc_feats"][0]

print("tracks:",len(genre_features["pop"]["mfcc_feats"]))
print("shape:", first_pop_mfcc_feat.shape)
tracks: 100
shape: (2993, 13)
In [67]:
30/2993
Out[67]:
0.010023387905111928

In our data, we have 100 pop-audio tracks, each with an array of MFCCs with shape 2993 by 13. Here, the ~66k signal values have been reduced to 2993 rows of MFCCs, each with 13 coefficients. This corresponds to roughly 50 signals per MFCC bucket.

I think of the MFCCs as approximations of the signal over a fixed winows. Had the window been infitesimally small, we would have the original signal. Instead, we have a reduced representation of the signal to frames of about 30/2993=0.01 seconds long.

Reducing the raw signal to the MFCC coefficients, we strip away the parts of audio most human-auditory systems fail to capture any way, and focus mainly on parts of the signal actually heard by humans. 13 is just an artibraty number of coefficients to use, but it seems to be the standard. The larger number of coefficients, the more information we can represent about the signal, but the more computationally expensive it is to process.

So here, there we restrict complexity in 2 default ways, for the sake of compute resources:

  • The reduced window size of 50 signals
  • The (only) 13 MFCC coefficients
In [68]:
mfcc_df = pd.DataFrame(first_pop_mfcc_feat)
mfcc_df.head()
Out[68]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 68.742758 3.814073 -3.599784 7.833474 3.092980 -6.869419 -15.232205 -41.764331 -1.330596 0.125669 -19.437477 -5.084497 -9.885876
1 67.763316 3.256654 -9.460819 6.473814 2.475559 -15.623884 -21.465911 -46.288898 -3.123732 -9.419571 -17.208975 -15.410589 -16.520576
2 68.518080 5.578065 -5.352609 8.176695 3.527002 -8.393237 -11.223802 -35.368283 -0.261433 -8.908667 -8.953004 -2.914417 -6.305045
3 67.094697 3.996013 -6.427080 5.749396 -2.582816 -17.091698 -20.062623 -33.964676 -5.710885 -18.084208 -19.549037 -10.228777 -9.426628
4 66.490638 5.390578 -0.088159 8.870295 3.075847 -16.341286 -19.648581 -17.943840 -1.139059 -13.105216 -12.382698 -0.950495 7.765570
In [77]:
ex.imshow(mfcc_df.T, aspect='auto', origin='lower', title="MFCC visualized", labels={"x": "centiseconds", "y": "Freqs in Hz"})
In [ ]:
# more analysis on MFCCs here: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

Spectrogram¶

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. It is a 2D plot of the frequency domain of the signal over time. The colors represent the intensity of the frequencies present in the signal.

Comparing a spectrogram to the MFCCs, we can see that the MFCCs are a simplified representation of the spectrogram. The MFCCs are a 2D array of coefficients, while the spectrogram is a 2D array of frequencies. The MFCCs are a simplified representation of the spectrogram, since they are a reduced set of coefficients that approximate the signal over a fixed window.

In [76]:
def log_specgram(audio, sample_rate, window_size=20, step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

def plot_log_specgram(audio, sample_rate, window_size=20, step_size=10, eps=1e-10):
        freqs, times, spectrogram = log_specgram(audio, sample_rate, window_size=20, step_size=10, eps=1e-10)
        ex.imshow(
            spectrogram.T, 
            aspect='auto', 
            origin='lower',
            # extent=[times.min(), times.max(), freqs.min(), freqs.max()]
            title='Spectrogram', 
            labels={"x": "centiseconds", "y": "Freqs in Hz"}
        ).show()
        return freqs, times, spectrogram

freqs, times, spectrogram = plot_log_specgram(signal_rs, sample_rate=22050)
In [78]:
# previous mfcc visualized
ex.imshow(mfcc_df.T, aspect='auto', origin='lower', title="MFCC visualized", labels={"x": "centiseconds", "y": "Freqs in Hz"})
In [92]:
time = np.linspace(0, 30, len(mfcc_df))
mfcc_df.set_index(time).plot()
In [ ]:
# These are the different mel-frequency cep. coefficients, varying over time. 
# I wonder why 0 is not around 0 (ish) but the others are? 

# Cool, seems natural to plot MFCC as spectrograms and compare tehm against spectrograms of the signal.
# They should look like coarser versions of the spectrogram

# Next steps could maybe be to take the mean normalization of both plots and compare them?
# filter_banks -= (numpy.mean(filter_banks, axis=0) + 1e-8)
# mfcc -= (numpy.mean(mfcc, axis=0) + 1e-8)

Covariance¶

In [55]:
first_pop_covariance = genre_features["pop"]["covariances"][0]
In [56]:
pd.DataFrame(first_pop_covariance).plot()
In [61]:
sig
Out[61]:
array([ 1131,  1578,  2107, ..., -2279, -4733, -6299], dtype=int16)

Mel frequency cepstral coefficients (MFCCs)¶

Audio features are classified into 3 categories: high-level, mid-level, and low-level features.

  • High-level features: genre, mood, instrumentation, rhythm, lyrics, chords
  • Mid-level features: beat-level attributes, pitch-like fluctuation, MFCCs
  • Low-level features: enery, zero-crossing rate, timbre, loudness, etc.

Audio feature levels (Ramaseshan, 2013)

The process of extracting MFCCs (mid-level and low-level features) is as follows:

  1. Take the Fourier transform of (a windowed excerpt of) a signal.
  2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
  3. Take the logs of the powers at each of the mel frequencies.
  4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
  5. The MFCCs are the amplitudes of the resulting spectrum.